Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
1.
BMC Genomics ; 24(1): 266, 2023 May 18.
Article in English | MEDLINE | ID: covidwho-2321452

ABSTRACT

BACKGROUND: The prevalence of the COVID-19 disease in recent years and its widespread impact on mortality, as well as various aspects of life around the world, has made it important to study this disease and its viral cause. However, very long sequences of this virus increase the processing time, complexity of calculation, and memory consumption required by the available tools to compare and analyze the sequences. RESULTS: We present a new encoding method, named PC-mer, based on the k-mer and physic-chemical properties of nucleotides. This method minimizes the size of encoded data by around 2 k times compared to the classical k-mer based profiling method. Moreover, using PC-mer, we designed two tools: 1) a machine-learning-based classification tool for coronavirus family members with the ability to recive input sequences from the NCBI database, and 2) an alignment-free computational comparison tool for calculating dissimilarity scores between coronaviruses at the genus and species levels. CONCLUSIONS: PC-mer achieves 100% accuracy despite the use of very simple classification algorithms based on Machine Learning. Assuming dynamic programming-based pairwise alignment as the ground truth approach, we achieved a degree of convergence of more than 98% for coronavirus genus-level sequences and 93% for SARS-CoV-2 sequences using PC-mer in the alignment-free classification method. This outperformance of PC-mer suggests that it can serve as a replacement for alignment-based approaches in certain sequence analysis applications that rely on similarity/dissimilarity scores, such as searching sequences, comparing sequences, and certain types of phylogenetic analysis methods that are based on sequence comparison.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , Phylogeny , Sequence Analysis, DNA , Nucleotides/genetics , Base Sequence , Algorithms
2.
Gigascience ; 122022 12 28.
Article in English | MEDLINE | ID: covidwho-2313424

ABSTRACT

BACKGROUND: Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade. RESULTS: In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29\%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12\%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants. CONCLUSIONS: By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants. AVAILABILITY: The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.


Subject(s)
COVID-19 , Deep Learning , Puma , Animals , SARS-CoV-2/genetics , Puma/genetics , Genome, Viral
3.
2022 IEEE International Conference on Big Data, Big Data 2022 ; : 5182-5188, 2022.
Article in English | Scopus | ID: covidwho-2249032

ABSTRACT

The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and determining if a lineage is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. Similarly, euclidean space is not considered the best choice when working with the classification and clustering tasks for biological sequences. For this purpose, we design a method that converts the protein (spike) sequences into the sequence similarity network (SSN). We can then use SSN as an input for the classical algorithms from the graph mining domain for the typical tasks such as classification and clustering to understand the data. We show that the proposed alignment-free method is able to outperform the current SOTA method in terms of clustering results. Similarly, we are able to achieve higher classification accuracy using well-known Node2Vec-based embedding compared to other baseline embedding approaches. © 2022 IEEE.

4.
8th International Conference on Contemporary Information Technology and Mathematics, ICCITM 2022 ; : 113-118, 2022.
Article in English | Scopus | ID: covidwho-2248726

ABSTRACT

A worldwide epidemic has been caused by the new coronavirus (COVID-19). The high transmission rate of this pathogen requires early prediction and appropriate identification of mutations. Predicting this evolution will aid in the early detection of new strains and potentially facilitate the design of more effective antiviral therapies. However, SARS-CoV, MERS-CoV, and SARS-CoV2 (known as COVID-19) are all challenging to predict because of the virus's polymorphic nature, which enables it to adapt and survive across species, so there is a strong need for prediction to characterize mutations using their genetic information. A working method based on deep learning has been proposed to identify unknown sequences of pathogens to mitigate this problem. This study aims to predict virus mutations, especially codon mutations, for six types of coronavirus mutations (MERS-CoV, SARS-CoV-l, SARS-CoV-2, Alpha, Beta, Gamma). In this work, long-term memory is used for base prediction as an alignment-free technique. This algorithm is applied to several DNAs of coronavirus mutations where the k-mer technique is applied to segment the data to create a unique vocabulary. Then the TF-IDF is subsequently used for the identified virus sequences. The results showed that this technique's predictive accuracy on this data set reached 99%. It should be noted that this model was developed in Python using the Keras library, which is part of the Tensorflow library. © 2022 IEEE.

5.
22nd International Workshop on Algorithms in Bioinformatics, WABI 2022 ; 242, 2022.
Article in English | Scopus | ID: covidwho-2055786

ABSTRACT

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes). © Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov.

6.
J Comput Biol ; 29(9): 1001-1021, 2022 09.
Article in English | MEDLINE | ID: covidwho-2017640

ABSTRACT

The comparison of DNA sequences is of great significance in genomics analysis. Although the traditional multiple sequence alignment (MSA) method is popularly used for evolutionary analysis, optimally aligning k sequences becomes computationally intractable when k increases due to the intrinsic computational complexity of MSA. Despite numerous k-mer alignment-free methods being proposed, the existing k-mer alignment-free methods may not truly capture the contextual structures of the sequences. In this study, we present a novel k-mer contextual alignment-free method (called kmer2vec), in which the sequence k-mers are semantically embedded to word2vec vectors, an essential technique in natural language processing. Consequently, the method converts each DNA/RNA sequence into a point in the word2vec high-dimensional space and compares DNA sequences in the space. Because the word2vec vectors are trained from the contextual relationship of k-mers in the genomes, the method may extract valuable structural information from the sequences and reflect the relationship among them properly. The proposed method is optimized on the parameters from word2vec training and verified in the phylogenetic analysis of large whole genomes, including coronavirus and bacterial genomes. The results demonstrate the effectiveness of the method on phylogenetic tree construction and species clustering. The method running speed is much faster than that of the MSA method, especially the phylogenetic relationships constructed by the kmer2vec method are more accurate than the conventional k-mer alignment-free method. Therefore, this approach can provide new perspectives for phylogeny and evolution and make it possible to analyze large genomes. In addition, we discuss special parameterization in the k-mer word2vec embedding construction. An effective tool for rapid SARS-CoV-2 typing can also be derived when combining kmer2vec with clustering methods.


Subject(s)
Algorithms , COVID-19 , Base Sequence , Humans , Phylogeny , SARS-CoV-2/genetics , Sequence Analysis, DNA/methods
7.
Genomics ; 114(4): 110414, 2022 07.
Article in English | MEDLINE | ID: covidwho-1895509

ABSTRACT

Classification of viruses into their taxonomic ranks (e.g., order, family, and genus) provides a framework to organize an abundant population of viruses. Next-generation metagenomic sequencing technologies lead to a rapid increase in generating sequencing data of viruses which require bioinformatics tools to analyze the taxonomy. Many metagenomic taxonomy classifiers have been developed to study microbiomes, but it is particularly challenging to assign the taxonomy of diverse virus sequences and there is a growing need for dedicated methods to be developed that are optimized to classify virus sequences into their taxa. For taxonomic classification of viruses from metagenomic sequences, we developed VirusTaxo using diverse (e.g., 402 DNA and 280 RNA) genera of viruses. VirusTaxo has an average accuracy of 93% at genus level prediction in DNA and RNA viruses. VirusTaxo outperformed existing taxonomic classifiers of viruses where it assigned taxonomy of a larger fraction of metagenomic contigs compared to other methods. Benchmarking of VirusTaxo on a collection of SARS-CoV-2 sequencing libraries and metavirome datasets suggests that VirusTaxo can characterize virus taxonomy from highly diverse contigs and provide a reliable decision on the taxonomy of viruses.


Subject(s)
COVID-19 , Viruses , Humans , Metagenome , Metagenomics/methods , Phylogeny , SARS-CoV-2/genetics , Viruses/genetics
8.
5th International Conference on Big Data Research, ICBDR 2021 ; : 42-49, 2021.
Article in English | Scopus | ID: covidwho-1784896

ABSTRACT

SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several orders of magnitude beyond what can be processed by the traditional methods designed to reconstruct a virus's evolutionary history, such as those that build a phylogenetic tree. Hence, new and scalable methods will need to be devised in order to make use of the ever increasing number of viral sequences being collected. Since identifying variants is an important part of understanding the evolution of a virus, in this paper, we propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants. Using a k-mer based feature vector generation and efficient feature selection methods, our approach is effective in identifying variants, as well as being efficient and scalable to millions of sequences. Such a clustering method allows us to show the relative proportion of each variant over time, giving the rate of spread of each variant in different locations - something which is important for vaccine development and distribution. We also compute the importance of each amino acid position of the spike protein in identifying a given variant in terms of information gain. Positions of high variant-specific importance tend to agree with those reported by the USA's Centers for Disease Control and Prevention (CDC), further demonstrating our approach. © 2021 ACM.

9.
Brief Bioinform ; 22(2): 924-935, 2021 03 22.
Article in English | MEDLINE | ID: covidwho-1343628

ABSTRACT

In this paper, we present a toolset and related resources for rapid identification of viruses and microorganisms from short-read or long-read sequencing data. We present fastv as an ultra-fast tool to detect microbial sequences present in sequencing data, identify target microorganisms and visualize coverage of microbial genomes. This tool is based on the k-mer mapping and extension method. K-mer sets are generated by UniqueKMER, another tool provided in this toolset. UniqueKMER can generate complete sets of unique k-mers for each genome within a large set of viral or microbial genomes. For convenience, unique k-mers for microorganisms and common viruses that afflict humans have been generated and are provided with the tools. As a lightweight tool, fastv accepts FASTQ data as input and directly outputs the results in both HTML and JSON formats. Prior to the k-mer analysis, fastv automatically performs adapter trimming, quality pruning, base correction and other preprocessing to ensure the accuracy of k-mer analysis. Specifically, fastv provides built-in support for rapid severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) identification and typing. Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, Middle East respiratory syndrome and other coronaviruses. This toolset is available at: https://github.com/OpenGene/fastv.


Subject(s)
SARS-CoV-2/isolation & purification , Sequence Analysis/methods , Viruses/isolation & purification , Algorithms , Genes, Viral , SARS-CoV-2/genetics , Viruses/genetics
10.
Infect Genet Evol ; 93: 104933, 2021 09.
Article in English | MEDLINE | ID: covidwho-1237810

ABSTRACT

A severe respiratory pneumonia COVID-19 has raged all over the world, and a coronavirus named SARS-CoV-2 is blamed for this global pandemic. Despite intensive research into the origins of the COVID-19 pandemic, the evolutionary history of its agent SARS-CoV-2 remains unclear, which is vital to control the pandemic and prevent another round of outbreak. Coronaviruses are highly recombinogenic, which are not well handled with alignment-based method. In addition, deletions have been found in the genomes of several SARS-CoV-2, which cannot be resolved with current phylogenetic methods. Therefore, the k-mer natural vector is proposed to explore hosts and transmission traits for SARS-CoV-2 using strict phylogenetic reconstruction. SARS-CoV-2 clustering with bat-origin coronaviruses strongly suggests bats to be the natural reservoir of SARS-CoV-2. By building bat-to-human transmission route, pangolin is identified as an intermediate host, and civet is predicted as a possible candidate. We speculate that SARS-CoV-2 undergoes cross-species recombination between bat and pangolin coronaviruses. This study also demonstrates transmission mode and features of SARS-CoV-2 in the COVID-19 pandemic when it broke out early around the world.


Subject(s)
COVID-19/transmission , Host-Pathogen Interactions , Phylogeny , SARS-CoV-2/genetics , SARS-CoV-2/pathogenicity , Animals , Biological Evolution , COVID-19/epidemiology , China , Chiroptera/virology , Coronavirus/genetics , Genome, Viral , Pangolins/virology , Spike Glycoprotein, Coronavirus/genetics , Viral Zoonoses/transmission , Viverridae/virology
11.
Acta Math Sci ; 41(3): 1017-1022, 2021.
Article in English | MEDLINE | ID: covidwho-1198468

ABSTRACT

The severe acute respiratory syndrome COVID-19 was discovered on December 31, 2019 in China. Subsequently, many COVID-19 cases were reported in many other countries. However, some positive COVID-19 samples had been reported earlier than those officially accepted by health authorities in other countries, such as France and Italy. Thus, it is of great importance to determine the place where SARS-CoV-2 was first transmitted to human. To this end, we analyze genomes of SARS-CoV-2 using k-mer natural vector method and compare the similarities of global SARS-CoV-2 genomes by a new natural metric. Because it is commonly accepted that SARS-CoV-2 is originated from bat coronavirus RaTG13, we only need to determine which SARS-CoV-2 genome sequence has the closest distance to bat coronavirus RaTG13 under our natural metric. From our analysis, SARS-CoV-2 most likely has already existed in other countries such as France, India, Netherland, England and United States before the outbreak at Wuhan, China.

SELECTION OF CITATIONS
SEARCH DETAIL